# Load packages.
library(tidyverse)
Bayesian calibration, on the other hand, can, in principle, handle
what history matching can’t, but it requires us to have done a good job
of incorporating the likelihood of the aforementioned outputs in the
prior. This distinction between history matching and Bayesian
calibration might be pithily summarised as “History matching is a
quicker and easier way to learn about a little, while Bayesian
calibration is a time-consuming and more-difficult way to learn about a
lot”.
Edwards,
Cameron and Rougier suggest that history matching can be a useful
pre-calibration step to sensibly inform the prior for Bayesian
calibration, but this is only reasonable if the desirable responses of
the outputs to the inputs is centred within the Not Yet Ruled Out space.
By analogy, sure, it seems sensible to start looking for your lost keys
under the streetlight at night, but that doesn’t mean they aren’t
actually somewhere in the dark.
History-matching methods are just fancy ways to get rid of
input values that give ridiculous output values. You do this by
considering what kind of previously-used input values gave ridiculous
output values, and deciding not to carry them forward.
For example, if you’ve ever made a meal and added too much
salt, you will have learned how much salt makes for a bad meal. Next
time you make the meal and you are unsure about how much salt to add,
you will consider your previous (a.k.a. historical) salting and say
“Well, there is definitely no point in adding more salt than I did
last time”. Congratulations, you have just applied history matching
to constrain the range of inputs (i.e. salt) so that the outputs
(i.e. tastiness of your meal) are more likely to be tasty / desirable /
sensible / realistic!
History matching is often applied to Gaussian processes,
which you can read about here. In this blog, I will also
apply history matching to Gaussian processes. You can read more about
Gaussian processes, in my other explainer, here
(Side note: Incidentally, check out this Art
of Manliness podcast episode with Daniel Holzman to learn a lot
about appropriate salting and more.)
So, what is the motivation behind Gaussian processes?
To understand history matching, I found Williamson et al.’s idea of the Not Ruled Out Yet space to be useful. The Not Ruled Out Yet space is the space of inputs (i.e. the scope of possible values for the inputs) that remains after I’ve gotten rid of the input values that produce ridiculous and unlikely outputs. We can come up with any number of ways to decide what should be ruled in and ruled out. Adrianakis et al. have a similar idea called the “non-implausible region” For example, the implausibility measure used by Williamson et al. is the difference between average predicted output value and observed output value, scaled to the standard deviation of all these differences:
\[ I(\ parameter\ value_{j}\ ) = max\{\ I_{i}(parameter\ value_{j})\ \} \]
where
\[
I_{i}(parameter\ value_{j}) = \frac{ |\ observation_{i} - E[\
prediction_{i}( parameter\ value_{j})\ ]\ | } {\sqrt{ Var[\
observation_{i} - E[\ prediction_{i}(parameter\ value_{j})\ ]\ ] }}
\] and where \(i\) refers to
runs of the model, and \(j\) refers to
particular parameter values. You use the implausability measure to rank
parameter values then trim off the worst.
Any choice of rule comes with trade-offs. The implausibility measure, for example, can be compared across different predictive models even with different outputs because it is a unitless, scaled score. But, if its application means it always trims the most-extreme scores, then repeated application (called “refocussing”) can result in a situation like Zeno’s dichotomy paradox and can mean the Not Ruled Out Yet space vanishes. To prevent this, some absolute, minimum size of the Not Ruled Out Yet space needs to be set…but couldn’t we just have set an absolute, maximum accepted difference in the first place?
History matching is also known as Monte Carlo filtering, which Saltelli et al. (2008, page 248) summarise as
“one samples the space of the input factors as in ordinary [Monte Carlo], and then categorizes the corresponding model output as either within or without the target region (the terms behaviour, \(B\), or nonbehaviour, \(\bar{B}\), are used). This categorization is then mapped back onto the input factors, each of which is thus also partitioned into a behavioural and nonbehavioural subset.”
* What is bayesian calibration? * I think that distinction between
history matching and Bayesian calibration is somewhat analogous to the
distinction between fixed effects and random effects in regression
analysis. The analogy I’m alluding to is that estimating regression
coefficients for fixed effects is like saying “This is the estimate;
other estimates are not acceptable”, and filtering parameter values
using history matching is like saying “These parameter values are
acceptable. Other parameter values are not acceptable”.
Similarly, estimating regression coefficients for random effects is like
saying “This is the distribution of acceptable coefficients, but I’m
not saying any particular one is best”, and filtering parameter
values using Bayesian calibration is like saying “This is the
distribution of acceptable parameter values, but I’m not saying any
particular one is best”.
Of course, there are many
differences between fitting regression coefficients and filtering
parameter values in simulation models. I present the previous analogy to
share something that helped me on my journey to understand all these
concepts. Analogies are temporary rafts to help us across rivers of
confusion; they need not be perfect and they can be left behind once we
reach the firm understanding of the other side.
For more on
Bayesian calibration, check out…
…
A general rule for life is never to solely rely on a seller’s word about the performance of some tool. Also, never forget that, in theory, theory and practive are the same but, in practice, they aren’t. Thankfully, there are sensible ways to assess how much our history matching might have helped. As a general warning, beware of academic clickbait, i.e. manipulation by relative and absolute statistics.
History matching assumes that outputs from previous models are
sufficient proxy observations of the truth, so we should use them to
constrain our wider understanding of the truth. But what if the outputs
from our previous models poorly represent the truth? Selection bias is a
beast!
I think selection bias is easiest to understand as
collider bias using directed acyclic graphs, which is beyond the scope
of this blog. Check out figure 1 in Griffith et
al. for a simple illustration of how selection bias changes our
perception of a relationship (URL in code box):
knitr::include_graphics("https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41467-020-19478-2/MediaObjects/41467_2020_19478_Fig1_HTML.png?as=webp")
Williamson et al. hint at selection bias when they say the following about ensemble filtering:
“The issue with this approach is that it does not account for the uncertainty in the unsampled, high dimensional parameter space”.
But the authors don’t acknowledge this issue for history matching;
why not? Perhaps I am missing something. Please, let me know.
(Side note: Selection bias isn’t merely us focussing on a subset of
situations. Focussing our analyses is permissible if we acknowledge that
we are doing it. If you don’t realise that you have selected a subset of
situations and you make statements about the whole, then your inferences
are incorrect – plain and simple.)
Another problem I have with history matching - and indeed any attempt to infer beyond what we actual observe - is that it assumes a linearity of output values’ effects to input values. By disregarding all input values beyond our Not Ruled Out Yet space, we don’t permit non-linear effects, reversal of effects, or pockets of sensible outputs after a gap of nonsensical ones. Below are three examples:
# Prepare data
webscrape <- read.csv("https://data.giss.nasa.gov/gistemp/graphs/graph_data/Global_Mean_Estimates_based_on_Land_and_Ocean_Data/graph.txt")
df <-
webscrape %>%
tidyr::separate_rows(colnames(webscrape), sep = ", ") %>%
dplyr::slice(4:n()) %>%
tidyr::separate(colnames(webscrape), sep = "\\s+", into = c("Year", "No_Smoothing", "Lowess(5)")) %>%
dplyr::select(-`Lowess(5)`) %>%
dplyr::mutate_if(is.character, as.numeric)
# Add the 1951-1980 baseline value
# https://earthobservatory.nasa.gov/world-of-change/decadaltemp.php
df$No_Smoothing <- df$No_Smoothing + 14
# Fit linear linear model to data from 1960.
timespan <-
df %>% dplyr::select(Year)
df <-
df %>%
dplyr::filter(Year < 1920) %>%
lm(formula = No_Smoothing ~ Year, data = .) %>%
predict(newdata = df %>% dplyr::select(Year)) %>%
as.data.frame() %>%
`colnames<-`("trend") %>%
dplyr::bind_cols(df)
# Plot.
(p_tempExtrap <-
df %>%
ggplot() +
geom_point(aes(x = Year, y = No_Smoothing),
size = 3, color = "grey") +
geom_line(aes(x = Year, y = trend),
linewidth = 3) +
ylab("Global average\nsurface temperature (Celsius)") +
labs(caption = "Data source: https://climate.nasa.gov/vital-signs/global-temperature/") +
theme(text = element_text(size = 20))
)
knitr::include_graphics("https://bpb-eu-w2.wpmucdn.com/blogs.bristol.ac.uk/dist/e/692/files/2022/01/PlotTwo-1024x512.png")
knitr::include_graphics("https://i.makeagif.com/media/6-28-2023/O09D6n.gif")
The gist of history of Williamson et al.’s idea of the Not Ruled Out Yet space is that we throw out the input values that produce ridiculous and unlikely outputs. But some of these input values might have produced output values within the Not Ruled Out Yet space. It seems very conservative to throw out any input values that misbehaved a little.